Message classification

ABSTRACT

One or more computing devices, systems, and/or methods for message classification are provided. For example, a set of messages is clustered into a set of clusters. A cluster comprises messages with similar features (e.g., similar subject lines, message body content, sender information, recipient information, structure, user action such as reading or deleting, spam vote information, etc.). Cluster features are computed for the clusters based upon features of messages within such clusters. A first table, comprising cluster entries corresponding cluster features of clusters, and a second table, comprising message entries corresponding to clusters to which messages are assigned, are created. Message features of a message are created, using the first table and second table, based upon features of the message and cluster features of clusters to which the message is assigned. A message classifier is used to classify the message (e.g., spam, safe, a threat, etc.) based upon the message features.

BACKGROUND

Many users may exchange messages through messaging services, such asemail, social network messaging, text messaging, etc. A message accountof a user with a messaging service may provide a messaging interfacewith various message folders, such as an inbox, a deleted folder, a spamfolder, a sent folder, etc. In this way, the user may receive, read, andsend messages through the messaging interface, and such messages may bestored within the appropriate message folders. In an example, the usermay send work messages to co-workers, personal messages to friends, etc.through the messaging interface. In another example, the user mayreceive work messages, personal messages, newsletters, advertisingmessages, etc. through the messaging interface.

Unfortunately, the user may receive unsolicited messages withundesirable content, such as offensive or unwanted content, referred toas spam. Many spam filters may attempt to recognize normal ongoing spamcampaigns from spammers who have been around awhile, that use the sameknown malicious content, that use the same internet protocol (IP)addresses that have been blacklisted, etc. However, some spammers mayhijack new IP address ranges, use bot nets, buy new domains throughwhich to send spam, use randomized malicious content, use variations insubjects, etc. in order to launch unusual and/or sudden spam attacksthat will not be recognized by spam filters until after harm has beendone because by the time enough spam votes are received for a spamclassifier to learn a bad reputation, the spam attack may have alreadyaffected numerous users. Because such attacks use randomized ordiffering subjects, content, sender credentials, etc. no single featurewill reveal the complete spam campaign. Thus, security of users maybecome compromised such as where users read and/or take action (e.g.,click on a link to malicious content) with regard to the spam. Also,users may become frustrated when their inbox is inundated with suchspam.

SUMMARY

In accordance with the present disclosure, one or more computing devicesand/or methods for message classification are provided. For example, aset of messages may be transformed into a set of bags of words. A bag ofwords may correspond to a message component (e.g., a subject line bag ofwords comprising words within a subject of a message; an email body bagof words comprising words within an email body of the message; etc.).The set of bags of words may be transformed into a set of hashdescriptions using a hash function. For example, each bag of words maybe transformed into a min-hash description based upon a number of words(e.g., 5 words or any other number of words) having a minimum hashvalue. The set of hash descriptions may be grouped (e.g., using amap-reduce framework for improved computation processing) to clustermessages of the set of messages into a set of clusters. The set ofclusters may cluster messages according to a subject line space, asender space, a content space, a user action space, an extensible markuplanguage (XML) document object model (DOM) structure space for messagebody content, etc. For example, messages with similar content may begrouped into a first cluster, messages with similar subject lines may begrouped into a second cluster, messages that users deleted beforereading may be grouped into a third cluster, etc.

Cluster features may be computed for clusters within the set of clustersbased upon features of messages within the clusters. A cluster featuremay correspond to an aggregation of spam scores for messages within acluster, recipient characteristic features of recipients of the messages(e.g., age, location, social network profile data, a frequency ofaccessing messages, etc.), user action features performed upon themessages by the recipients (e.g., reading a message, deleting a messageswithout first reading the message, voting a message as spam, forwardinga message, replying to a message, and/or other actions indicatingwhether the user had interest in the message or might have felt themessage was not desirable such as spam), message content features of themessages (e.g., grammar, spelling, word choice, embedded links, embeddedimages, a topic of the message content, word count, paragraph/sentencestructure, etc.), attachment features of the messages (e.g., a topic ofan attachment), subject line features of the messages (e.g., spelling,grammar, sentence structure, word count, etc.), etc.

A first table comprising cluster entries for the set of clusters may becreated. The first table may comprise a first cluster entrycorresponding to a first cluster and cluster features of the firstcluster. A second table comprising message entries for the set ofmessages may be created. The second table may comprise a first messageentry corresponding to a first message and identifiers of clusters towhich the first message is assigned. Message features for a message(e.g., a message within a message inbox) may be created, using the firsttable and the second table (e.g., the second table may be used toidentify clusters to which the message is assigned, and the first tablemay be used to identify cluster features of those clusters to which themessage is assigned, which may be used as message features), based uponfeatures of the message and cluster features of clusters to which themessage is assigned. A message classifier (e.g., trained to use alearned decision rule for classifying messages as spam, a threat,personal, work, etc. that is learned based upon a set of trainingmessage data labeled based upon spam vote information for the set oftraining messages) may be used to classify the message based upon themessage features. For example, the message may be retroactivelyclassified (e.g., classified and moved to a spam folder after themessage was already delivered to the message inbox). In another example,clusters of messages may be retained and updated as new messages areintercepted by a messaging service, and thus a new message may beclassified and/or filtered before reaching the message inbox.

DESCRIPTION OF THE DRAWINGS

While the techniques presented herein may be embodied in alternativeforms, the particular embodiments illustrated in the drawings are only afew examples that are supplemental of the description provided herein.These embodiments are not to be interpreted in a limiting manner, suchas limiting the claims appended hereto.

FIG. 1 is an illustration of a scenario involving various examples ofnetworks that may connect servers and clients.

FIG. 2 is an illustration of a scenario involving an exampleconfiguration of a server that may utilize and/or implement at least aportion of the techniques presented herein.

FIG. 3 is an illustration of a scenario involving an exampleconfiguration of a client that may utilize and/or implement at least aportion of the techniques presented herein.

FIG. 4 is a flow chart illustrating an example method for messageclassification.

FIG. 5A is a component block diagram illustrating an example system formessage classification, where a message classifier is trained.

FIG. 5B is a component block diagram illustrating an example system formessage classification, where hash descriptions are created from a setof messages.

FIG. 5C is a component block diagram illustrating an example system formessage classification, where messages are clustered to create a set ofclusters.

FIG. 5D is a component block diagram illustrating an example system formessage classification, where a first table and a second table arecreated.

FIG. 5E is a component block diagram illustrating an example system formessage classification.

FIG. 6 is an illustration of a scenario featuring an examplenon-transitory machine readable medium in accordance with one or more ofthe provisions set forth herein.

DETAILED DESCRIPTION

Subject matter will now be described more fully hereinafter withreference to the accompanying drawings, which form a part hereof, andwhich show, by way of illustration, specific example embodiments. Thisdescription is not intended as an extensive or detailed discussion ofknown concepts. Details that are known generally to those of ordinaryskill in the relevant art may have been omitted, or may be handled insummary fashion.

The following subject matter may be embodied in a variety of differentforms, such as methods, devices, components, and/or systems.Accordingly, this subject matter is not intended to be construed aslimited to any example embodiments set forth herein. Rather, exampleembodiments are provided merely to be illustrative. Such embodimentsmay, for example, take the form of hardware, software, firmware or anycombination thereof.

1. Computing Scenario

The following provides a discussion of some types of computing scenariosin which the disclosed subject matter may be utilized and/orimplemented.

1.1. Networking

FIG. 1 is an interaction diagram of a scenario 100 illustrating aservice 102 provided by a set of servers 104 to a set of client devices110 via various types of networks. The servers 104 and/or client devices110 may be capable of transmitting, receiving, processing, and/orstoring many types of signals, such as in memory as physical memorystates.

The servers 104 of the service 102 may be internally connected via alocal area network 106 (LAN), such as a wired network where networkadapters on the respective servers 104 are interconnected via cables(e.g., coaxial and/or fiber optic cabling), and may be connected invarious topologies (e.g., buses, token rings, meshes, and/or trees). Theservers 104 may be interconnected directly, or through one or more othernetworking devices, such as routers, switches, and/or repeaters. Theservers 104 may utilize a variety of physical networking protocols(e.g., Ethernet and/or Fiber Channel) and/or logical networkingprotocols (e.g., variants of an Internet Protocol (IP), a TransmissionControl Protocol (TCP), and/or a User Datagram Protocol (UDP). The localarea network 106 may include, e.g., analog telephone lines, such as atwisted wire pair, a coaxial cable, full or fractional digital linesincluding T1, T2, T3, or T4 type lines, Integrated Services DigitalNetworks (ISDNs), Digital Subscriber Lines (DSLs), wireless linksincluding satellite links, or other communication links or channels,such as may be known to those skilled in the art. The local area network106 may be organized according to one or more network architectures,such as server/client, peer-to-peer, and/or mesh architectures, and/or avariety of roles, such as administrative servers, authenticationservers, security monitor servers, data stores for objects such as filesand databases, business logic servers, time synchronization servers,and/or front-end servers providing a user-facing interface for theservice 102.

Likewise, the local area network 106 may comprise one or moresub-networks, such as may employ differing architectures, may becompliant or compatible with differing protocols and/or may interoperatewithin the local area network 106. Additionally, a variety of local areanetworks 106 may be interconnected; e.g., a router may provide a linkbetween otherwise separate and independent local area networks 106.

In the scenario 100 of FIG. 1, the local area network 106 of the service102 is connected to a wide area network 108 (WAN) that allows theservice 102 to exchange data with other services 102 and/or clientdevices 110. The wide area network 108 may encompass variouscombinations of devices with varying levels of distribution andexposure, such as a public wide-area network (e.g., the Internet) and/ora private network (e.g., a virtual private network (VPN) of adistributed enterprise).

In the scenario 100 of FIG. 1, the service 102 may be accessed via thewide area network 108 by a user 112 of one or more client devices 110,such as a portable media player (e.g., an electronic text reader, anaudio device, or a portable gaming, exercise, or navigation device); aportable communication device (e.g., a camera, a phone, a wearable or atext chatting device); a workstation; and/or a laptop form factorcomputer. The respective client devices 110 may communicate with theservice 102 via various connections to the wide area network 108. As afirst such example, one or more client devices 110 may comprise acellular communicator and may communicate with the service 102 byconnecting to the wide area network 108 via a wireless local areanetwork 106 provided by a cellular provider. As a second such example,one or more client devices 110 may communicate with the service 102 byconnecting to the wide area network 108 via a wireless local areanetwork 106 provided by a location such as the user's home or workplace(e.g., a WiFi (Institute of Electrical and Electronics Engineers (IEEE)Standard 802.11) network or a Bluetooth (IEEE Standard 802.15.1)personal area network). In this manner, the servers 104 and the clientdevices 110 may communicate over various types of networks. Other typesof networks that may be accessed by the servers 104 and/or clientdevices 110 include mass storage, such as network attached storage(NAS), a storage area network (SAN), or other forms of computer ormachine readable media.

1.2. Server Configuration

FIG. 2 presents a schematic architecture diagram 200 of a server 104that may utilize at least a portion of the techniques provided herein.Such a server 104 may vary widely in configuration or capabilities,alone or in conjunction with other servers, in order to provide aservice such as the service 102.

The server 104 may comprise one or more processors 210 that processinstructions. The one or more processors 210 may optionally include aplurality of cores; one or more coprocessors, such as a mathematicscoprocessor or an integrated graphical processing unit (GPU); and/or oneor more layers of local cache memory. The server 104 may comprise memory202 storing various forms of applications, such as an operating system204; one or more server applications 206, such as a hypertext transportprotocol (HTTP) server, a file transfer protocol (FTP) server, or asimple mail transport protocol (SMTP) server; and/or various forms ofdata, such as a database 208 or a file system. The server 104 maycomprise a variety of peripheral components, such as a wired and/orwireless network adapter 214 connectible to a local area network and/orwide area network; one or more storage components 216, such as a harddisk drive, a solid-state storage device (SSD), a flash memory device,and/or a magnetic and/or optical disk reader.

The server 104 may comprise a mainboard featuring one or morecommunication buses 212 that interconnect the processor 210, the memory202, and various peripherals, using a variety of bus technologies, suchas a variant of a serial or parallel AT Attachment (ATA) bus protocol; aUniform Serial Bus (USB) protocol; and/or Small Computer SystemInterface (SCI) bus protocol. In a multibus scenario, a communicationbus 212 may interconnect the server 104 with at least one other server.Other components that may optionally be included with the server 104(though not shown in the schematic architecture diagram 200 of FIG. 2)include a display; a display adapter, such as a graphical processingunit (GPU); input peripherals, such as a keyboard and/or mouse; and aflash memory device that may store a basic input/output system (BIOS)routine that facilitates booting the server 104 to a state of readiness.

The server 104 may operate in various physical enclosures, such as adesktop or tower, and/or may be integrated with a display as an“all-in-one” device. The server 104 may be mounted horizontally and/orin a cabinet or rack, and/or may simply comprise an interconnected setof components. The server 104 may comprise a dedicated and/or sharedpower supply 218 that supplies and/or regulates power for the othercomponents. The server 104 may provide power to and/or receive powerfrom another server and/or other devices. The server 104 may comprise ashared and/or dedicated climate control unit 220 that regulates climateproperties, such as temperature, humidity, and/or airflow. Many suchservers 104 may be configured and/or adapted to utilize at least aportion of the techniques presented herein.

1.3. Client Device Configuration

FIG. 3 presents a schematic architecture diagram 300 of a client device110 whereupon at least a portion of the techniques presented herein maybe implemented. Such a client device 110 may vary widely inconfiguration or capabilities, in order to provide a variety offunctionality to a user such as the user 112. The client device 110 maybe provided in a variety of form factors, such as a desktop or towerworkstation; an “all-in-one” device integrated with a display 308; alaptop, tablet, convertible tablet, or palmtop device; a wearable devicemountable in a headset, eyeglass, earpiece, and/or wristwatch, and/orintegrated with an article of clothing; and/or a component of a piece offurniture, such as a tabletop, and/or of another device, such as avehicle or residence. The client device 110 may serve the user in avariety of roles, such as a workstation, kiosk, media player, gamingdevice, and/or appliance.

The client device 110 may comprise one or more processors 310 thatprocess instructions. The one or more processors 310 may optionallyinclude a plurality of cores; one or more coprocessors, such as amathematics coprocessor or an integrated graphical processing unit(GPU); and/or one or more layers of local cache memory. The clientdevice 110 may comprise memory 301 storing various forms ofapplications, such as an operating system 303; one or more userapplications 302, such as document applications, media applications,file and/or data access applications, communication applications such asweb browsers and/or email clients, utilities, and/or games; and/ordrivers for various peripherals. The client device 110 may comprise avariety of peripheral components, such as a wired and/or wirelessnetwork adapter 306 connectible to a local area network and/or wide areanetwork; one or more output components, such as a display 308 coupledwith a display adapter (optionally including a graphical processing unit(GPU)), a sound adapter coupled with a speaker, and/or a printer; inputdevices for receiving input from the user, such as a keyboard 311, amouse, a microphone, a camera, and/or a touch-sensitive component of thedisplay 308; and/or environmental sensors, such as a global positioningsystem (GPS) receiver 319 that detects the location, velocity, and/oracceleration of the client device 110, a compass, accelerometer, and/orgyroscope that detects a physical orientation of the client device 110.Other components that may optionally be included with the client device110 (though not shown in the schematic architecture diagram 300 of FIG.3) include one or more storage components, such as a hard disk drive, asolid-state storage device (SSD), a flash memory device, and/or amagnetic and/or optical disk reader; and/or a flash memory device thatmay store a basic input/output system (BIOS) routine that facilitatesbooting the client device 110 to a state of readiness; and a climatecontrol unit that regulates climate properties, such as temperature,humidity, and airflow.

The client device 110 may comprise a mainboard featuring one or morecommunication buses 312 that interconnect the processor 310, the memory301, and various peripherals, using a variety of bus technologies, suchas a variant of a serial or parallel AT Attachment (ATA) bus protocol;the Uniform Serial Bus (USB) protocol; and/or the Small Computer SystemInterface (SCI) bus protocol. The client device 110 may comprise adedicated and/or shared power supply 318 that supplies and/or regulatespower for other components, and/or a battery 304 that stores power foruse while the client device 110 is not connected to a power source viathe power supply 318. The client device 110 may provide power to and/orreceive power from other client devices.

In some scenarios, as a user 112 interacts with a software applicationon a client device 110 (e.g., an instant messenger and/or electronicmail application), descriptive content in the form of signals or storedphysical states within memory (e.g., an email address, instant messengeridentifier, phone number, postal address, message content, date, and/ortime) may be identified. Descriptive content may be stored, typicallyalong with contextual content. For example, the source of a phone number(e.g., a communication received from another user via an instantmessenger application) may be stored as contextual content associatedwith the phone number. Contextual content, therefore, may identifycircumstances surrounding receipt of a phone number (e.g., the date ortime that the phone number was received), and may be associated withdescriptive content. Contextual content, may, for example, be used tosubsequently search for associated descriptive content. For example, asearch for phone numbers received from specific individuals, receivedvia an instant messenger application or at a given date or time, may beinitiated. The client device 110 may include one or more servers thatmay locally serve the client device 110 and/or other client devices ofthe user 112 and/or other individuals. For example, a locally installedwebserver may provide web content in response to locally submitted webrequests. Many such client devices 110 may be configured and/or adaptedto utilize at least a portion of the techniques presented herein.

2. Presented Techniques

One or more computing devices and/or techniques for messageclassification are provided. For example, a user may send, receive, andread messages through a message interface associated with a messageservice. The message service may provide the user with the ability tomark/vote messages as spam, and move such messages to a spam folder. Themessage service may also employ a spam filter that may attempt torecognize normal ongoing spam campaigns from spammers who have beenaround awhile, that use the same known malicious content, that use thesame internet protocol (IP) addresses that have been blacklisted, etc.However, these spam filters are unable to recognize certain spamcampaigns such as where a spammer has hijacked new IP address ranges,uses bot nets, buys new domains through which to send spam, usesrandomized malicious content, uses variations in subjects, etc. in orderto launch unusual and/or sudden spam campaigns/attacks. Such spamcampaigns will not be recognized by the spam filters until after harmhas been done because by the time enough spam votes are received for aspam classifier to learn a bad reputation, the spam campaign may havealready affected numerous users. No single feature (e.g., a commonsubject line, common email body content, common sender credentials,etc.) will reveal the complete spam campaign because such spam campaignsuse randomized or differing subjects, content, sender credentials, etc.Thus, security of users may become compromised such as where users readand/or take action upon a spam message (e.g., click on a link tomalicious content). Also, users may waste time, computing resources,and/or bandwidth accessing the message inbox populated with spammessages that are uninteresting and/or a threat to the user. The messageinterface may become so cluttered with spam messages that the messageinterface becomes unwieldy for the user (e.g., the user may have to siftthrough hundreds of spam emails in order to identify emails ofinterest).

Accordingly, as provided herein, a campaign detector and a machinelearning classifier may be used to discriminate between spam campaignsand other types of messages (e.g., work messages, personal messages,desirable ad campaigns, etc.). In particular, multi-level clustering isused to identify messages that make up a campaign, and a messageclassifier is used to classify messages based upon features of messagesand cluster features of clusters into which such messages are clustered.Because a variety of features aspects (e.g., spam scores of messageswithin a cluster, subject lines of messages within the cluster, messagebody content of messages within the cluster, actions users performedupon messages within the cluster, etc.) are used, difficult to detectspam messages and/or campaigns can be quickly and efficiently detectednotwithstanding a spammer using randomized and/or differing subjectlines, content, sender credentials, etc. Message classification can beexpanded to millions or even billions of messages by using a localitysensitive hashing method for scalable clustering with linear complexityand/or using a map reduced framework.

An embodiment of classifying messages is illustrated by an examplemethod 400 of FIG. 4. In an example, the method 400 may be implementedby a computing device or a collection of computing devices such aswithin a computer cluster environment, a cloud computing environment, adistributed network environment, etc. A set of messages, such as emails(e.g., emails that were previously delivered to users or emails that areyet to be delivered to users), may be evaluated for classification(e.g., as spam, safe, work related, personal, newsletter, a threat,etc.) and/or identification of spam campaigns. At 402, the set ofmessages may be clustered into a set of clusters. In particular, the setof messages may be transformed into a set of bags of words. A bag ofwords may correspond to a message component (e.g., a first bag of wordsmay comprise words from a subject line of a message, a second bag ofwords may comprise words from a message body of the message, a third bagof words may comprise information regarding an XML DOM structure spacefor message body content of the message, a fourth bag of words maycomprise recipient information of a recipient of the messages, such asan age, gender, occupation, social network profile information, emailusage information, and/or any other public information or informationfor which the recipient provides affirmative consent that suchinformation may be used such as for message classification and/orfiltering, etc.).

The set of bags of words may be transformed into a set of hashdescriptions using a hash function. For example, the hash function maybe used to transform a bag of words into a min-hash description basedupon a number of words (e.g., 5 words or any other number of words)having a minimum hash value (e.g., a locality sensitive hashing methodwith a min-hash trick, which can be used for clustering messages).

The set of hash descriptions may be grouped (e.g., using a map-reduceframework that parcels out tasks to multiple nodes of a cluster, andorganizes results from the nodes into a cohesive output, which canreduce processing time for tasks such as listing and/or counting anumber of times words appear within messages or hash descriptions) tocluster messages of the set of messages into a set of clusters. Forexample, if a first hash description for a first subject line of a firstmessage and a second hash description for a second subject line of asecond message are within a similarity threshold, then the first messageand the second message (e.g., the messages associated with bags of wordsthat were transformed into the two similar hash descriptions) may beclustered together.

A message may be clustered into a single cluster or multiple clusters(e.g., a message may be clustered into a first cluster with othermessages having similar subject lines, and the message may also beclustered into a second cluster with other messages having similar XMLDOM structures for message body content). The messages may be clusteredwithin a sender description space (e.g., messages sent by senders withsimilar attributes such as IP addresses may be clustered together), asubject line space (e.g., messages with similar subject lines may beclustered together), a message body content space (e.g., messages withsimilar message body content such as similar words, sentence structure,grammar, vocabulary, images, or other embedded content may be clusteredtogether), a user action space (e.g., messages may be clustered basedupon having similar user read action features, user reply actionfeatures, user forward action features, user delete action features,and/or user spam vote features, such as where messages that are deletedwithout being read may be clustered together), an XML DOM structurespace for message body content (e.g., messages with similar HypertextMarkup Language HTML or other markup language attributes may beclustered together). In this way, messages may be clustered intoclusters based upon a variety of features.

At 404, cluster features may be computed for clusters within the set ofclusters based upon features of messages within the clusters. In anexample, a cluster feature may correspond to a distance (e.g., adiameter of the cluster, such as a max Hadamard distance between twobags of words within the cluster) between a first bag of words of amessage within a cluster and a second bag of words of another messagewithin the cluster. In another example, a number of messages within thecluster may be measured as a cluster feature. In another example, acluster feature may correspond to an aggregation of spam scores formessages within the cluster, recipient characteristic features ofrecipients of the messages, user action features performed upon themessages by the recipients, message content features of the messages,subject line features of the messages, etc.

At 406, a first table, comprising cluster entries for the set ofclusters, may be created. For example, a first cluster entry maycorrespond to a first cluster (e.g., identified by a cluster identifier)and cluster features of the first cluster. At 408, a second table,comprising message entries for the set of messages, may be created. Forexample, the second table may comprise a first message entrycorresponding to a first message (e.g., identified by a messageidentifier) and identifiers of clusters to which the first message isassigned. In this way, the second table may be queried using the messageidentifier to identify the first message entry comprising theidentifiers of clusters to which the first message is assigned. Theidentifiers of the clusters may be used to query the first table inorder to identify cluster features of the clusters to which the firstmessage is assigned so that the cluster features can be used forcreating message features for the first message.

At 410, message features may be created for a message (e.g., a messagethat has been delivered to a message inbox, a message that has yet to bedelivered to a user's message account, a message residing in any messagecontainer, such as a sent folder, an outbox, a deleted folder, a spamfolder where a message previously marked as spam may be retroactivelyclassified as non-spam and moved to the message inbox, etc.). Themessage features may be based upon features of the message (e.g., asubject line feature, a message body feature, an XML DOM structurefeature, a sender feature, a recipient feature, and/or other featuressuch as a topic of the message as determined by textual features withinthe subject line and message body, a topic of content linked to by themessage, a topic of an attached for the message, etc.) and/or clusterfeatures of the clusters to which the message is assigned.

A message feature may be created based upon a number of message within acluster to which the message is assigned (e.g., a large number of verysimilar messages within a cluster may be indicative of a campaign asopposed to personalized messages). The message feature may be createdbased upon user actions upon messages within the cluster (e.g., acluster comprising messages where users delete the messages beforereading them may be indicative of uninteresting or spam messages). Themessage feature may be created based upon subject line features ofmessages within the cluster (e.g., improper grammar, spelling, and theuse of special characters may be indicative of spam). The messagefeature may be created based upon sender description features ofmessages within the cluster (e.g., senders using blacklisted IPaddresses or domains may be indicative of spammers). The message featuremay be created based upon word content of message bodies, attachments,links, etc. of the messages within the cluster (e.g., grammar, spelling,sentence structure, paragraph structure, spam key words, etc.). Themessage feature may be created based upon message body structures of themessages within the cluster (e.g., XML DOM structure). The messagefeature may be created based upon spam filter scores for the messageswithin the cluster (e.g., a spam filter may be used to score themessages as to how likely such messages are spam). Because the messagemay be assigned to multiple clusters, the message feature may begenerated based upon an aggregate of features of the multiple clustersto which the message is assigned.

At 412, a message classifier may be used to classify the message basedupon the message feature. In an example, the message may beretroactively classified after the message was delivered to a messageinbox or may be classified before potential delivery to a user. In anexample, if the message is classified as spam, then the message may bemoved from the message inbox to a spam folder. The message classifiermay be used to identify a spam campaign based upon one or more clusterscomprising messages classified as spam (e.g., subject lines, messagebodies, attachments, links, sender addresses, etc. that are used as partof the spam campaign may be identified as correlated together to definethe spam campaign).

The message classifier may be trained based upon message features oftraining messages (e.g., trained to classify messages as spam, non-spam,work, personal, advertising, etc. based upon subject line features, useraction features, sender description features, word count of messagesbodies, message body structures such as XML DOM structures, etc.). In anexample, the message classifier may be trained to use a learned decisionrule (e.g., a rule that evaluates message features to determine whetherto classify a message as spam or not, such as where certain messagefeatures and values of messages features may be weighted to identify ascore that is evaluated against a threshold indicative of a certainclassification) that is learned using a set of training message datalabeled based upon spam indicators. A spam indicator may comprise spamvote information, whether a training message was located within a bulkfolder such as a spam folder, whether a training message wasretrospectively filtered by another algorithm such as another spamfilter, etc. For example, if a training message receives a thresholdnumber of spam votes within a threshold time (e.g., 50 spam votes within10 days of the training message being sent), then the training messagemay be designated as spam (e.g., message features of the trainingmessage may be used to teach the message classifier as to what messagefeatures are indicative of spam). Otherwise, the training message isdesignated as not spam (e.g., message features of the training messagemay be used to teach the message classifier as to what message featuresare indicative of non-spam messages). In an example, a number ofpositive spam training message examples may be balanced with a number ofnegative non-spam training message examples by subsampling a ratio ofnegative non-spam training message examples (e.g., subsample a ratio of1:500 from negative non-spam training messages).

FIGS. 5A-5E illustrate examples of a system 500 for classifyingmessages. FIG. 5A illustrates a classifier trainer functionality module510 configured to train a message classifier 502. The classifier trainerfunctionality module 510 may acquire training message data 504. Forexample, messages 506 over a first timespan (e.g., messages received bya messaging service over a 6 hour timespan or any other timespan) may beretrieved. Spam vote data 508 (e.g., users marking messages as spam,users moving messages into a spam folder, or other actions indicative ofusers treating messages as spam) over a second timespan (e.g., a 10 daytimespan or any other timespan from when the messages were sent) may beretrieved. The messages 506 and spam votes 508 may be used as thetraining message data 504 by the classifier trainer functionality module510 to train the message classifier 502 to use learned decision rules514 for classifying messages, resulting in a trained message classifier512. For example, the spam votes 508 may indicate that a thresholdnumber of spam votes were received for a message within the messages 506of the training message data 504 (e.g., 200 spam votes). A learneddecision rule may be created to classify messages as spam where suchmessages have similar message features as the message for which thethreshold number of spam votes was received. In this way, messages withsimilar message features as the message may be classified by the trainedmessage classifier 512 as spam (e.g., the trained message classifier 512is trained to classify messages as spam, non-spam, work, personal,advertising, etc. based upon subject line features, user actionfeatures, sender description features, word count of messages bodies,message body structures such as XML DOM structures, and/or other messagefeatures of messages having threshold numbers of spam votes 508).

FIG. 5B illustrates a set of hash descriptions 530 being created basedupon a set of messages. For example, the set of messages may comprise afirst message 520 with a subject line, sender info (e.g., IP address,domain, etc.), recipient info (e.g., frequency of email service usage,IP address, domain, age, gender, social network profile information,etc.), user action for the first message 520 (e.g., did the user read,delete, mark as spam, reply, forward, or perform other actions upon thefirst message 520), message body content or attached content (e.g.,text, links, an embedded image, an attachment, etc.), an XML DOMstructure of message body of the first message 520 (e.g., HTMLfeatures), etc. The set of messages may comprise a plurality ofmessages, such as the first message 520, an nth message 522, etc.

The set of messages may be transformed into bags of words 524, where abag of words corresponds to a message component (e.g., a subject line,sender information, recipient information, user action information,message body content, XML DOM structure information, etc.). For example,a first bag of words 526 comprises words or other content within asubject line of the first message 520. A second bag of words 528comprises words or other content of a message body of the first message520. A third bag of words may comprise words or other content of asubject line of the nth message 522.

The bags of words 524 may be transformed into a set of hash descriptions530 using a hash function. For example, the hash function may be used totransform the first bag of words 526 into a first hash description 532,such as a min-hash description based upon a number of words (e.g., 5words or any other number of words) having a minimum hash value.Similarly, the second bag of words 528 may be transformed into a secondhash description 534.

FIG. 5C illustrates a clustering functionality module 538 used to groupthe set of hash descriptions 530 to cluster the set of messages into aset of clusters 540. For example, if a hash description for a subjectline of a message and a hash description for a subject line of a secondmessage are within a similarity threshold, then the message and thesecond message (e.g., the messages associated with bags of words thatwere transformed into the two similar hash descriptions) may beclustered together. In an example, messages may be clustered into one ormore clusters corresponding to sender description space, one or moreclusters corresponding to a subject line space (e.g., messages withsimilar subject line features such as words, symbols, grammar, etc. maybe clustered together), one or more clusters corresponding to a messagebody content space (e.g., messages with similar message body featuressuch as words, links, grammar, etc. may be clustered together), one ormore clusters corresponding to user action space (e.g., messages thatare deleted without being read may be clustered together), one or moreclusters corresponding to an XML DOM structure space (e.g., messageswith similar HTML features may be clustered together), etc.

In an example, messages, such as a first message, a fifth message, aninth message, and/or other messages, associated with hash descriptionsof subject lines that are within a similarity threshold may be clusteredinto a first cluster 542. Messages, such as the first message, a secondmessage, an eleventh message, and/or other messages, associated withhash descriptions of XML DOM structures that are within a similaritythreshold may be clustered into a second cluster 546.

Cluster features may be computed for the clusters based upon features ofmessages within such clusters. For example, first cluster features 544may be computed for the first cluster 542 based upon features of thefirst message, the fifth message, the ninth message, and/or othermessages within the first cluster 542 (e.g., user action features,recipient characteristic features, spam scores, subject line features,message content features, etc. may be aggregated together to compute thefirst cluster features 544). Nth cluster features 548 may be computedfor the nth cluster 546 based upon features of the first message, thesecond message, the eleventh message, and/or other messages within thenth cluster 546 (e.g., user action features, recipient characteristicfeatures, spam scores, subject line features, message content features,etc. may be aggregated together to compute the nth cluster features548).

FIG. 5D illustrates a table creation functionality module 550 configuredto create a first table 552 and a second table 554 based upon the set ofclusters 540. The first table 552 may be populated with cluster entriesfor clusters within the set of clusters 540. For example, a firstcluster entry 551 may be created for the first cluster 542 of the set ofclusters 540, a second cluster entry may be created for a second clusterof the set of clusters 540, etc. The first cluster entry 551 maycomprise the first cluster features 544 of the first cluster 542. Thesecond cluster entry may comprise second cluster features of the secondcluster.

The second table 554 may be populated with message entries correspondingto the set of messages. For example, a first message entry 555 may becreated for the first message, a second message entry may be created forthe second message, etc. The first message entry 555 may compriseidentifiers of clusters to which the first message is assigned, such asto the first cluster 542, a fifth cluster, and the nth cluster 546. Thesecond message entry may comprise identifiers of clusters to which thesecond message is assigned, such as to a seventh cluster and the nthcluster 546. In this way, the second table 554 may be queried using amessage identifier to identify a message entry comprising identifiers ofclusters to which the message is assigned. The identifiers of theclusters may be used to query the first table 552 in order to identifycluster features of the clusters to which the message is assigned sothat the cluster features can be used for creating message features forthe message.

FIG. 5E illustrates the trained message classifier 512 classifyingmessages and/or identifying spam campaigns. For example, the trainedmessage classifier 512 may have been trained to classify messages basedupon message features, such as described in FIG. 5A. For example, thetrained message classifier 512 may utilize the learned decision rules514 to retroactively classify messages within a message inbox 560. Thetrained message classifier 512 may evaluate a message within the messageinbox 560 by clustering the message with other messages based uponmessage features of the messages. The first table 552, comprisingcluster entries corresponding to clusters and cluster features ofclusters, and the second table 554, corresponding to messages andidentifiers of clusters to which such messages are assigned, may becreated based upon the clusters of messages. Message features of themessage may be created using the first table 552 and the second table554. For example, the second table 554 may be queried using a messageidentifier of the message to identify a message entry comprisingidentifiers of clusters to which the message is assigned. Theidentifiers of the clusters may be used to query the first table 552 inorder to identify cluster features of the clusters to which the messageis assigned. The cluster features and features of the message are usedto create message features for the message.

The trained message classifier 512 uses the learned decision rules 514to classify the message, such as classifying 556 the message as spam(e.g., the message features may be indicative of features of spammessages). The message may be moved from the message inbox 560 to a spamfolder. The trained message classifier 512 may utilize the cluster ofmessages, the first table 552, and/or the second table 554 to identifycharacteristics of a spam campaign 558.

FIG. 6 is an illustration of a scenario 600 involving an examplenon-transitory machine readable medium 602. The non-transitory machinereadable medium 602 may comprise processor-executable instructions 612that when executed by a processor 616 cause performance (e.g., by theprocessor 616) of at least some of the provisions herein. Thenon-transitory machine readable medium 602 may comprise a memorysemiconductor (e.g., a semiconductor utilizing static random accessmemory (SRAM), dynamic random access memory (DRAM), and/or synchronousdynamic random access memory (SDRAM) technologies), a platter of a harddisk drive, a flash memory device, or a magnetic or optical disc (suchas a compact disk (CD), a digital versatile disk (DVD), or floppy disk).The example non-transitory machine readable medium 602 storescomputer-readable data 604 that, when subjected to reading 606 by areader 610 of a device 608 (e.g., a read head of a hard disk drive, or aread operation invoked on a solid-state storage device), express theprocessor-executable instructions 612. In some embodiments, theprocessor-executable instructions 612, when executed cause performanceof operations, such as at least some of the example method 400 of FIG.4, for example. In some embodiments, the processor-executableinstructions 612 are configured to cause implementation of a system,such as at least some of the example system 500 of FIGS. 5A-5E, forexample.

3. Usage of Terms

As used in this application, “component,” “module,” “system”,“interface”, and/or the like are generally intended to refer to acomputer-related entity, either hardware, a combination of hardware andsoftware, software, or software in execution. For example, a componentmay be, but is not limited to being, a process running on a processor, aprocessor, an object, an executable, a thread of execution, a program,and/or a computer. By way of illustration, both an application runningon a controller and the controller can be a component. One or morecomponents may reside within a process and/or thread of execution and acomponent may be localized on one computer and/or distributed betweentwo or more computers.

Unless specified otherwise, “first,” “second,” and/or the like are notintended to imply a temporal aspect, a spatial aspect, an ordering, etc.Rather, such terms are merely used as identifiers, names, etc. forfeatures, elements, items, etc. For example, a first object and a secondobject generally correspond to object A and object B or two different ortwo identical objects or the same object.

Moreover, “example” is used herein to mean serving as an example,instance, illustration, etc., and not necessarily as advantageous. Asused herein, “or” is intended to mean an inclusive “or” rather than anexclusive “or”. In addition, “a” and “an” as used in this applicationare generally be construed to mean “one or more” unless specifiedotherwise or clear from context to be directed to a singular form. Also,at least one of A and B and/or the like generally means A or B or both Aand B. Furthermore, to the extent that “includes”, “having”, “has”,“with”, and/or variants thereof are used in either the detaileddescription or the claims, such terms are intended to be inclusive in amanner similar to the term “comprising”.

Although the subject matter has been described in language specific tostructural features and/or methodological acts, it is to be understoodthat the subject matter defined in the appended claims is notnecessarily limited to the specific features or acts described above.Rather, the specific features and acts described above are disclosed asexample forms of implementing at least some of the claims.

Furthermore, the claimed subject matter may be implemented as a method,apparatus, or article of manufacture using standard programming and/orengineering techniques to produce software, firmware, hardware, or anycombination thereof to control a computer to implement the disclosedsubject matter. The term “article of manufacture” as used herein isintended to encompass a computer program accessible from anycomputer-readable device, carrier, or media. Of course, manymodifications may be made to this configuration without departing fromthe scope or spirit of the claimed subject matter.

Various operations of embodiments are provided herein. In an embodiment,one or more of the operations described may constitute computer readableinstructions stored on one or more computer readable media, which ifexecuted by a computing device, will cause the computing device toperform the operations described. The order in which some or all of theoperations are described should not be construed as to imply that theseoperations are necessarily order dependent. Alternative ordering will beappreciated by one skilled in the art having the benefit of thisdescription. Further, it will be understood that not all operations arenecessarily present in each embodiment provided herein. Also, it will beunderstood that not all operations are necessary in some embodiments.

Also, although the disclosure has been shown and described with respectto one or more implementations, equivalent alterations and modificationswill occur to others skilled in the art based upon a reading andunderstanding of this specification and the annexed drawings. Thedisclosure includes all such modifications and alterations and islimited only by the scope of the following claims. In particular regardto the various functions performed by the above described components(e.g., elements, resources, etc.), the terms used to describe suchcomponents are intended to correspond, unless otherwise indicated, toany component which performs the specified function of the describedcomponent (e.g., that is functionally equivalent), even though notstructurally equivalent to the disclosed structure. In addition, while aparticular feature of the disclosure may have been disclosed withrespect to only one of several implementations, such feature may becombined with one or more other features of the other implementations asmay be desired and advantageous for any given or particular application.

What is claimed is:
 1. A method of message classification, the methodcomprising: executing, on a processor of a computing device,instructions that cause the computing device to perform operations, theoperations comprising: clustering messages into clusters based upon hashdescriptions of bags of words corresponding to message components of themessages; computing cluster features for the clusters based upon one ormore features of one or more messages within the clusters; creating afirst table comprising cluster entries for the clusters, wherein a firstcluster entry of a first cluster is populated with one or more clusterfeatures derived from one or more message components of at least onemessage clustered within the first cluster; creating a second tablepopulated with message entries for the messages, wherein a first messageentry, of the message entries in the second table, for a first messageis populated with identifiers of one or more clusters into which thefirst message is clustered; querying the first table and the secondtable to create one or more message features for a second message basedupon one or more second features of the second message and one or moresecond cluster features of one or more second clusters to which thesecond message is assigned; training a message classifier to use alearned decision rule that is trained using a set of training messagedata labeled based upon one or more spam indicators; balancing a numberof positive spam training message examples with a number of negativenon-spam training message examples, in association with the messageclassifier, by subsampling a ratio of negative non-spam training messageexamples; and classifying the second message based upon the one or moremessage features and the learned decision rule of the messageclassifier.
 2. The method of claim 1, wherein the computing clusterfeatures comprises: measuring a distance between a first bag of wordsand a second bag of words within a cluster to create a cluster feature.3. The method of claim 1, wherein the computing cluster featurescomprises: measuring a number of messages within a cluster to create acluster feature.
 4. The method of claim 1, wherein the computing clusterfeatures comprises: aggregating spam scores for messages within acluster, recipient characteristic features of recipients of the messageswithin the cluster, user action features performed upon the messageswithin the cluster by the recipients, message content features of themessages within the cluster, and subject line features of the messageswithin the cluster to create the cluster features.
 5. The method ofclaim 1, wherein the clustering comprises: clustering the messageswithin a sender description space, a subject line space, a message bodycontent space, a user action space, and an extensible markup language(XML) document object model (DOM) structure space for message bodycontent.
 6. The method of claim 1, wherein the clustering comprises:clustering the messages based upon a user read action feature, a userreply action feature, a user forward action feature, a user deleteaction feature, and a user spam vote feature.
 7. The method of claim 1,comprising: transforming, using a hash function, a bag of words into amin-hash description based upon a number of words having a minimum hashvalue.
 8. The method of claim 1, comprising: retroactively classifyingthe second message after the second message was delivered to a messageinbox.
 9. The method of claim 8, comprising: responsive to classifyingthe second message as spam, moving the second message from the messageinbox to a spam folder.
 10. The method of claim 1, wherein theclustering comprises: utilizing a locality sensitive hashing techniquefor clustering the messages.
 11. The method of claim 1, wherein theclustering comprises: clustering the first message into both the firstcluster and into a second cluster.
 12. The method of claim 1,comprising: if a training message receives a spam vote spam indicatorwithin a threshold timespan, then designating the training message asspam, otherwise, designating the training message as not spam.
 13. Themethod of claim 11, wherein identifiers of the first cluster and thesecond cluster are populated within the first message entry.
 14. Themethod of claim 1, comprising: using the message classifier to identifya spam campaign.
 15. A computing device comprising: a processor; andmemory comprising processor-executable instructions that when executedby the processor cause performance of operations, the operationscomprising: clustering messages into clusters based upon features of themessages; computing cluster features for the clusters based upon one ormore features of one or more messages within the clusters; creating afirst table comprising cluster entries for the clusters, wherein a firstcluster entry of a first cluster is populated with one or more clusterfeatures derived from one or more message components of at least onemessage clustered within the first cluster; creating a second tablepopulated with message entries for the messages, wherein a first messageentry, of the message entries in the second table, for a first messageis populated with identifiers of one or more clusters into which thefirst message is clustered; querying the first table and the secondtable to create one or more message features for a second message basedupon one or more second features of the second message and one or moresecond cluster features of one or more second clusters to which thesecond message is assigned; training a message classifier to use alearned decision rule that is trained using a set of training messagedata labeled based upon one or more spam indicators; balancing a numberof positive spam training message examples with a number of negativenon-spam training message examples, in association with the messageclassifier, by subsampling a ratio of negative non-spam training messageexamples; and classifying the second message based upon the one or moremessage features and the learned decision rule of the messageclassifier.
 16. The computing device of claim 15, wherein the one ormore message features of the second message are created based upon anumber of messages within a cluster to which the second message isassigned, user actions upon messages within the cluster, subject linefeatures of the messages within the cluster, sender description featuresof senders of the messages within the cluster, word content of messagebodies of the messages within the cluster, message body structures ofthe messages within the cluster, and spam filtering scores for themessages within the cluster.
 17. The computing device of claim 15,wherein the operations comprise: generating a message feature based uponan aggregate of features of two or more clusters to which the secondmessage is assigned.
 18. A non-transitory machine readable medium havingstored thereon processor-executable instructions that when executedcause performance of operations, the operations comprising: clusteringmessages into clusters; computing cluster features for the clustersbased upon one or more features of one or more messages within theclusters; creating a first table comprising cluster entries for theclusters, wherein a first cluster entry of a first cluster is populatedwith one or more cluster features derived from one or more messagecomponents of at least one message clustered within the first cluster;creating a second table populated with message entries for the messages,wherein a first message entry, of the message entries in the secondtable, for a first message is populated with identifiers of one or moreclusters into which the first message is clustered; querying the firsttable and the second table to create one or more message features for asecond message based upon one or more second features of the secondmessage and one or more second cluster features of one or more secondclusters to which the second message is assigned; training a messageclassifier to use a learned decision rule that is trained using a set oftraining message data labeled based upon one or more spam indicators;balancing a number of positive spam training message examples with anumber of negative non-spam training message examples, in associationwith the message classifier, by subsampling a ratio of negative non-spamtraining message examples; and classifying the second message based uponthe one or more message features and the learned decision rule of themessage classifier.
 19. The non-transitory machine readable medium ofclaim 18, wherein the computing cluster features comprises: measuring adistance between a first bag of words and a second bag of words within acluster to create a cluster feature.
 20. The non-transitory machinereadable medium of claim 18, wherein the operations comprise:retroactively classifying the second message after the second messagewas delivered to a message inbox.